Document page decomposition by the bounding-box project

نویسندگان

  • Jaekyu Ha
  • Robert M. Haralick
  • Ihsin T. Phillips
چکیده

This paper describes a method for extracting words, textlines and text blocks by analyzing the spatial configuration of bounding boxes of connected components on a given document image. The basic idea is that connected components of black pixels can be used as computational units in document image analysis. In this paper, the problem of extracting words, textlines and text blocks is viewed as a clustering problem in the &dimensional discrete domain. Our main strategy is that profiling analysis is utilized to measure horizontal or vertical gaps of (groups of) components during the process of image segmentation. For this purpose, we compute the smallest rectangular box, called the bounding box, which circumscribes a connected component. Those boxes are projected horizontally and/or vertically, and local and global projection profiles are analyzed for word, textline and text-block segmentation. In the last step of segmentation, the document decomposition hierarchy is produced from these segmented objects.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization

The problem of computing category agnostic bounding box proposals is utilized as a core component in many computer vision tasks and thus has lately attracted a lot of attention. In this work we propose a new approach to tackle this problem that is based on an active strategy for generating box proposals that starts from a set of seed boxes, which are uniformly distributed on the image, and then...

متن کامل

Finding the Best-Fit Bounding-Boxes

The bounding-box of a geometric shape in 2D is the rectangle with the smallest area in a given orientation (usually upright) that complete contains the shape. The best-fit bounding-box is the smallest bounding-box among all the possible orientations for the same shape. In the context of document image analysis, the shapes can be characters (individual components) or paragraphs (component groups...

متن کامل

Document Layout Structure Extraction Using Bounding Boxes of Diierent Entities

This paper presents an eecient and accurate technique for document page layout structure extraction and classiication by analyzing the spatial connguration of the bounding boxes of diierent entities on a given image. The text, table, and nontext structures are detected on document images. The text-lines and words are extracted and the tabular structure is further decomposed into row and column ...

متن کامل

Word Segmentation for Document Images by Successively Merging Adjacent Character Bounding Boxes by Iterative Dilation

A new method of word segmentation for document images is presented. The method uses the bounding box regions to enclose the letters (characters) of the words and then the resulting letter spaces are progressively filled to merge the character bounding boxes to get the word bounding boxes. The method holds good for inclined and irregularly distributed words. The proposed method completely avoids...

متن کامل

Notes on Binary Dumbbell Trees

Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes show how binary dumbbell trees can be obtained, and how they can be used to construct, in O(n log n) time, a spanner of bounded degree and weight proportional to O(log n) times the weight of a minimum spanning tree. The reader is assumed to be familiar with the sp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995